Lab 1: Basic Data Wrangling and Plotting, Distributions



Submission Instructions

  • This lab will be submitted in pairs (if you don’t have a pair, please contact us) via the submission link in moodle.

  • Your final submission should include two files: an Rmd file (with your answers filled-in) and an html file that was generated automatically by knitting the Rmd file using knitr. Name your files as <ID1>_<ID2>.Rmd and <ID1>_<ID2>.html (insert your ID numbers instead).

  • Grading: There are \(8\) questions with overall \(15\) sub-questions. Each sub-question is worth \(6\frac{2}{3}\) points to the overall lab grade. The questions vary in length and difficulty level. It is recommended to start with the simpler and shorter questions. Points may be reduced for incorrect naming of files, missing parts and problems in knitting the Rmd file and general appearance of the report.

  • Libraries: The only allowed libraries are listed below (do not add additional libraries without permission from the course staff):

library(tidyverse) # This includes dplyr, stringr, ggplot2, .. 
library(data.table)
library(rworldmap) # world map
library(ggthemes)
library(reshape2) # melt: change data-frame format long/wide
library(e1071) # skewness and kurtosis
library(rvest)
library(corrplot)
library(moments)
library(spatstat.geom)



Analysis of the World Democracy Index Dataset

The wikipedia/Democracy_Index website hosts world-wide data on different measurements of democracy index for world countries. For more information about it, please visit here.

We will focus on analyzing the changes in the index in different countries, as well as the individual components comprising the index, and comparison to other datasets.

General Guidance

  • Your solution should be submitted as a full report integrating text, code, figures and tables. For each question, describe first in the text of your solution what you’re trying to do, then include the relevant code, then the results (e.g. figures/tables) and then a textual description of them.

  • In most questions the extraction/manipulation of relevant parts of the data-frame can be performed using commands from the tidyverse and dplyr R packages, such as head, arrange, aggregate, group-by, filter, select, summaries, mutate etc.

  • When displaying tables, show the relevant columns and rows with meaningful names, and describe the results.

  • When displaying figures, make sure that the figure is clear to the reader, axis ranges are appropriate, labels for the axis, title and different curves/bars are displayed clearly (font sizes are large enough), a legend is shown when needed etc. Explain and describe in text what is shown in the figure.

  • It could be that in some cases data are missing (e.g. NA). Make sure that all your calculations (e.g. taking the maximum, average, correlation etc.) take this into account. Specifically, the calculations should ignore the missing values to allow us to compute the desired results for the rest of the values (for example, using the option na.rm = TRUE or us = "complete.obs").

Questions:

  1. Loading data and basic processing:
    1. Load the democracy-index html into an R object, using the rvest package.
      Next, extract the three tables shown in the web-page as List by region, List by country and components into three separate R data-frames. Display the top five rows of each table to check that they were loaded correctly.
    2. Display in a new table the top five countries in terms of the democracy index in 2022. Show only the country name and the democracy index.
      Repeat the same with the five bottom countries in 2022.
      Repeat the same with the five top and bottom countries according to the average index value of all the \(15\) years between 2006 and 2022 for which data is available in the table List by country.
  2. Plotting distributions of groups of countries:
    1. Make one figure showing seven boxplots representing the distributions of the democracy index in 2022 of the different world regions given in the List by country table (each boxplot should represent the distribution of all countries within a specific region).
      Next, for each region that has at least one outlier country, find and list all the outliers that appear in the boxplot.
      (Hint: You may use the boxplot.stats command).
    2. Make a figure showing density plots for the same distributions of the democracy index in 2022 in the seven different regions. Do the densities resemble to the Normal distribution? Compute the mean, variance, skewness and kurtosis for all the distributions, display them in a table and explain what they mean about the empirical distribution of the data.
  3. Comparing countries and showing trends in democracy index:
    1. Write a function that receives as input a data-frame, and a vector of country names (as strings). The function plots the values of the democracy index of these countries in different colors as a function of the year (from 2006 to 2022), shown on the same graph as curves with different colors or symbols. Use meaningful axis and plot labels, and add an informative legend. Use the function and plot of the democracy index for five countries of your choice.
      Use the same function for the table List by region where the seven region names as inserted as input instead of countries, to show changes in the world regions democracy index over time.

    2. Divide the countries into eight separate groups (clusters) as follows:

  • First, countries whose index increased (one cluster) or decreased (another cluster) by at least \(1.5\) points between 2006 and 2022.
  • Second, countries whose index increased (one cluster) or decreased (another cluster) by between \(0.75\) to \(1.5\) points between 2006 and 2022.
  • Next, countries that dropped by at least \(0.75\) points after \(2006\), and then recovered by at least \(0.75\) points in \(2022\) compared to the lowest drop.
  • Similarly, countries that increased by at least \(0.75\) points after \(2006\), and then dropped by at least \(0.75\) points in \(2022\) compared to the highest point.
  • Next, countries that had barely changed from \(2006\) to \(2022\), i.e. that the difference between their highest and lowest index was less than \(0.5\) points.
  • Finally, all other countries.
    For each of the eight groups of countries, plot their changes using the function from 2.a. Describe the patterns you see in the different groups.

Remark: Don’t worry if some of the groups you get are large with countries with very similar colors, and/or a small graph panel due to a large legend.

  1. Change in category:
    For each of the four different regime types (Full democracy, Flawed democracy, Hybrid regime, Authoritarian), use the countries democracy index data frame to estimate the probability of a country to go from one such a regime in \(2006\) to each of the other four regimes in \(2022\). Show the results (sixteen estimated probabilities) in a \(4\)-by-\(4\) table, and also in a heatmap.
    Remarks: Your estimates should simply be the empirical frequencies - for example, if \(2\) out \(20\) countries moved from Authoritarian in \(2006\) to Hybrid regime in \(2022\), then get an estimate of \(0.2\) for the probability of such a regime change).
    Use the table By regime type from the democracy index webpage to determine the regime type category based on the democracy index value.

  2. Joining data from additional tables:

    1. Load tables for the gdp, population size, incarnation rates and area web pages using the rvest library.
      For each of the four web-page, extract the table for countries into an R data-frame.
      Join the table of the democracy index by country with these four table, using the countries names for joining. Display the top five rows of the joined table
    2. Fit a simple linear regression model with the democracy index at \(2022\) as the predictor and GDP (PPP) per capita (use the CIA estimates) as the response, and report the regression results.
      Next, Make a scatter plot of the the GDP (PPP) per capita (y-axis) vs. the democracy index at \(2022\), with the fittedthe regression line. Describe your results.
      Repeat the same steps with incarnation rate (per 100,000) as the response
  3. Empirical Cumulative Distribution Function (CDF):

    1. Let \(X\) be a random variable representing the GDP (PPP) per capita of a randomly selected country in 2022, where countries are selected uniformly at random from all world countries. Compute and plot the empirical CDF of \(X\).
    2. Let \(Y\) be a random variable representing the GDP (PPP) per capita of a randomly selected person in the world in 2022, where a person is selected uniformly at random from all world population. Compute and plot the empirical CDF for \(Y\) and explain the differences from the distribution for \(X\). Remark: Use the population size data to compute the empirical CDF for this case. It is possible to use the library spatstat.geom.
    3. Let \(Z\) be a random variable representing the GDP (PPP) per capita of a randomly selected person in the world in 2022, where the location of the person is selected uniformly at random from all the land area on earth. Compute and plot the empirical CDF for \(Y\) and explain the differences from the distribution for \(X\). Compare the median, and the \(25\%\) and \(75\%\) percentiles of \(X,Y\) and \(Z\). Are they similar or different? explain. Remark: Use the countries land area (in \(km^2\) or \(mi^2\)) to compute the empirical CDF for this case. You will need to parse the corresponding column to get the numerical data.
  4. Displaying data on the world map:
    Use the rworldmap package to display the world map and color each country based on the average democracy index across the years from \(2006\) to \(2022\). Describe the resulting map in a couple of sentences.
    Next, repeat all parts above , but this time display in the map the difference in the index between \(2022\) and \(2006\).

    Guidance: Use the joinCountryData2Map and mapCountryData commands to make the plots. Keep countries with missing data in white.

  5. Coponents of the Demography Index:

    1. Join the components table with the main table from the previous questions. Display the top five rows. Next, compute the correlation between all pairs of the five democracy components (Electoral process and pluralism, Functioning of government, Political participation, Political culture and Civil liberties), and plot the resulting \(5\)-by-\(5\) correlations matrix in a heatmap. (It is possible to use the corrplot library).
    2. Run multiple linear regression where the covariates are the different democracy sub-indices form the components table, and the response variable that you try to predict the GDP (PPP) per capita of each country.
      Show a summary of your regression analysis. What coefficients are significant at significance level \(\alpha=0.01\)?
      What countries are outliers? Display the five countries with the highest and lowest residuals in a table. Can you think of other factors contributing to their high/low GDP (PPP) per capita?

Good luck!

Solution: (Fill code, text, plots etc.)

1.a. Loading the data via URL connection:

democracy <- read_html("https://en.wikipedia.org/wiki/Democracy_Index")
all.tables = html_nodes(democracy, "table")  

# Use html_table to extract the individual tables from the all.tables object:
categories <- as.data.frame(html_table(all.tables[3], fill = TRUE))  # Example 

#based on the example we called all the relevant tables in order and names them accordingly 
list_by_region <- as.data.frame(html_table(all.tables[4], fill = TRUE))
list_by_region22 <- as.data.frame(html_table(all.tables[5], fill = TRUE))
list_by_country <- as.data.frame(html_table(all.tables[6], fill = TRUE))
components <- as.data.frame(html_table(all.tables[7], fill = TRUE))

# we'll call the top five of the required variances
head(categories)
##          Type.of.regime                                        Score Countries
## 1        Type of regime                                        Score    Number
## 2      Full democracies                        9.01–10.00  8.01–9.00        24
## 3    Flawed democracies                         7.01–8.00  6.01–7.00        48
## 4        Hybrid regimes                         5.01–6.00  4.01–5.00        36
## 5 Authoritarian regimes 3.01–4.00  2.01–3.00   1.01–2.00   0.00–1.00        59
##   Countries.1 Proportion.ofWorld.population....
## 1         (%) Proportion ofWorld population (%)
## 2       14.4%                              8.0%
## 3       28.7%                             37.3%
## 4       21.6%                             17.9%
## 5       35.3%                             36.9%
head(list_by_region)
##                            Region Coun.tries X2022 X2021 X2020 X2019 X2018
## 1                   North America          2  8.37  8.36  8.58  8.59  8.56
## 2                  Western Europe         21  8.36  8.23  8.29  8.35  8.35
## 3 Latin America and the Caribbean         24  5.79  5.83  6.09  6.13  6.24
## 4            Asia and Australasia         28  5.46  5.46  5.62  5.67  5.67
## 5      Central and Eastern Europe         28  5.39  5.36  5.36  5.42  5.42
## 6              Sub-Saharan Africa         44  4.14  4.12  4.16  4.26  4.36
##   X2017 X2016 X2015 X2014 X2013 X2012 X2011 X2010 X2008 X2006
## 1  8.56  8.56  8.56  8.59  8.59  8.59  8.59  8.63  8.64  8.64
## 2  8.38  8.40  8.42  8.41  8.41  8.44  8.40  8.45  8.61  8.60
## 3  6.26  6.33  6.37  6.36  6.38  6.36  6.35  6.37  6.43  6.37
## 4  5.63  5.74  5.74  5.70  5.61  5.56  5.51  5.53  5.58  5.44
## 5  5.40  5.43  5.55  5.58  5.53  5.51  5.50  5.55  5.67  5.76
## 6  4.35  4.37  4.38  4.34  4.36  4.33  4.32  4.23  4.28  4.24
head(list_by_country)
##           Region X2022.rank       Country      Regime.type X2022 X2021 X2020
## 1  North America         12        Canada   Full democracy  8.88  8.87  9.24
## 2  North America         30 United States Flawed democracy  7.85  7.85  7.92
## 3 Western Europe         20       Austria   Full democracy  8.20  8.07  8.16
## 4 Western Europe         36       Belgium Flawed democracy  7.64  7.51  7.51
## 5 Western Europe         37        Cyprus Flawed democracy  7.38  7.43  7.56
## 6 Western Europe          6       Denmark   Full democracy  9.28  9.09  9.15
##   X2019 X2018 X2017 X2016 X2015 X2014 X2013 X2012 X2011 X2010 X2008 X2006
## 1  9.22  9.15  9.15  9.15  9.08  9.08  9.08  9.08  9.08  9.08  9.07  9.07
## 2  7.96  7.96  7.98  7.98  8.05  8.11  8.11  8.11  8.11  8.18  8.22  8.22
## 3  8.29  8.29  8.42  8.41  8.54  8.54  8.48  8.62  8.49  8.49  8.49  8.69
## 4  7.64  7.78  7.78  7.77  7.93  7.93  8.05  8.05  8.05  8.05  8.16  8.15
## 5  7.59  7.59  7.59  7.65  7.53  7.40  7.29  7.29  7.29  7.29  7.70  7.60
## 6  9.22  9.22  9.22  9.20  9.11  9.11  9.38  9.52  9.52  9.52  9.52  9.52
head(components)
##               Rank
## 1                 
## 2 Full democracies
## 3                1
## 4                2
## 5                3
## 6                4
##   .mw.parser.output..tooltip.dotted.border.bottom.1px.dotted.cursor.help.Δ.Rank
## 1                                                                              
## 2                                                              Full democracies
## 3                                                                              
## 4                                                                              
## 5                                                                             2
## 6                                                                              
##            Country      Regime.type    Overall.score          Δ.Score
## 1                                                                    
## 2 Full democracies Full democracies Full democracies Full democracies
## 3           Norway   Full democracy             9.81             0.06
## 4      New Zealand   Full democracy             9.61             0.14
## 5          Iceland   Full democracy             9.52             0.34
## 6           Sweden   Full democracy             9.39             0.13
##   Elec.toral.pro.cessand.plura.lism Func.tioningof.govern.ment
## 1                                                             
## 2                  Full democracies           Full democracies
## 3                             10.00                       9.64
## 4                             10.00                       9.29
## 5                             10.00                       9.64
## 6                              9.58                       9.64
##   Poli.ticalpartici.pation Poli.ticalcul.ture  Civilliber.ties
## 1                                                             
## 2         Full democracies   Full democracies Full democracies
## 3                    10.00              10.00             9.41
## 4                    10.00               8.75            10.00
## 5                     8.89               9.38             9.71
## 6                     8.33              10.00             9.41

1.b.

# Select the top 5 countries with the highest democracy index in 2022
top <- select(list_by_country,Country,X2022) %>% arrange(desc(X2022)) %>% head() 

# Select the bottom 5 countries with the lowest democracy index in 2022
bottom <- select(list_by_country,Country,X2022) %>% arrange(X2022) %>% head() 

# Display the top and bottom countries
top
##       Country X2022
## 1      Norway  9.81
## 2 New Zealand  9.61
## 3     Iceland  9.52
## 4      Sweden  9.39
## 5     Finland  9.29
## 6     Denmark  9.28
bottom
##                            Country X2022
## 1                      Afghanistan  0.32
## 2                          Myanmar  0.74
## 3                      North Korea  1.08
## 4         Central African Republic  1.35
## 5                            Syria  1.43
## 6 Democratic Republic of the Congo  1.48
# Calculate the average democracy index for each country
averages <- list_by_country %>%
  mutate(avg = rowMeans(select(., 5:19), na.rm = TRUE)) %>%
  select(Country, avg)

# Select the top 5 countries with the highest average democracy index and the bottom 5 countries with the lowest average democracy index
top_averages<- averages %>% arrange(desc(avg)) %>% head() 
bottom_averages<- averages %>% arrange(avg) %>% head() 

# Display the top and bottom countries based on average democracy index
top_averages
##       Country      avg
## 1      Norway 9.830667
## 2     Iceland 9.562000
## 3      Sweden 9.524667
## 4     Denmark 9.305333
## 5 New Zealand 9.268667
## 6     Finland 9.140667
bottom_averages
##                            Country      avg
## 1                      North Korea 1.062000
## 2                             Chad 1.569333
## 3         Central African Republic 1.581333
## 4                            Syria 1.700667
## 5                     Turkmenistan 1.741333
## 6 Democratic Republic of the Congo 1.808000
  • The top five countries in terms of the democracy index in 2022 are: Norway, New Zeland, Iceland, Sweden, Finland and Danmark

  • The bottom five countries in terms of the democracy index in 2022 are: Afghanistan, Myanmar, North Korea, Central African Republic, Syria and Democratic Republic of the Congo

  • The top five countries according to the average index value of all the 15 years are: Norway, Iceland, Sweden, Danmark New Zeland, Finland. We can see that they are the same countries than the top five in 2022, just in a different order.

  • The top five countries according to the average index value of all the 15 years are: North Korea, Chad, Central African Republic, Syria, Turkmenistan, Democratic Republic of the Congo.

2.a.

# Create a boxplot showing the distribution of democracy index in 2022 by region
ggplot(list_by_country, aes(x = Region, y = `X2022`, fill = Region)) +
  geom_boxplot() +
  xlab("Region") +
  ylab("Democracy Index") +
  ggtitle("Boxplots by Region") +
  theme(axis.title.x = element_text(margin = margin(t = 10)),
        axis.text.x = element_text(angle = 45, hjust = 1.1, vjust = 1),
        plot.title = element_text(face = "bold", hjust = 0.5))

# Find the outliers for each region
outliers <-list_by_country %>% group_by(Region) %>% summarise(outliers = paste(boxplot.stats(`X2022`)$out, collapse = ", "))

# Display the outliers for each region
outliers
## # A tibble: 7 × 2
##   Region                          outliers
##   <chr>                           <chr>   
## 1 Asia and Australasia            ""      
## 2 Central and Eastern Europe      ""      
## 3 Latin America and the Caribbean ""      
## 4 Middle East and North Africa    "7.93"  
## 5 North America                   ""      
## 6 Sub-Saharan Africa              ""      
## 7 Western Europe                  "4.35"
  • We can see that for the region of Asia and Australasia, this is the region that have the biggest difference between the minimum index and the maximum, a large step may indicate that the data within that group exhibit a wide range of values. It has a median of approximately 6.25, the lower percentile at approximately 3.75 and the upper percentile at approximately 7.25. The majority of countries in this region are Flawed Democracys. It does not have any outliers.

  • For the Central and Eastern Europe region, the width of the box looks like the precedent region but their gap between the maximum and the minimum is smaller. It’s median is approximately 6.25, the lower percentile at approximately 3.60 and the upper percentile at approximately 7.The majority of countries in this region are Flawed Democracy. It does not have any outliers. This region and the precedent are very alike.

  • For the Latin America and the Caribbean, the width of the box is a lot smaller than the two precedents regions, the small width indicates that the values are clustered closely together and have limited variability, but the gap between the minimum index and the maximum is big. It has a median of approximately 6.25, the lower percentile at approximately 5 and the upper percentile at approximately 7.1. The majority of countries in this region are Flawed Democracy. It does not have any outliers.

  • For the Middle east and North Africa, the width of the box is smaller than the precedent region, the small width indicates that the values are clustered closely together and have limited variability, the gap between the minimum index and the maximum is also not big. It’s median is the lowest of every regions and is approximately 3.1, the lower percentile at approximately 2.5 and the upper percentile at approximately 3.75. The majority of countries in this region are Authoritarian regimes. It does have one outlier at approximately 7.93.

  • For the North America, the width of the box is the smaller than every other regions, that is because it contains only two countries, and these two countries look alike as a term of the type of regime. The median is approximately 8.2 and it lower percentile, upper percentile, maximum and minimum are all between 7.75 and 9.The two countries seem to be or Flawed democracy or Full democracy. It does not have any outliers.

  • For the Sub-Saharan Africa, the width of the box looks like the firsts two regions, this suggests that there is considerable variability in the variable being measured within that region. It’s median is approximately 3.75, the lower percentile at approximately 3.1 and the upper percentile at approximately 5.3. The democracy indexes can go from 1.25 to 8. The majority of countries in this region are Authoritarian regimes. It does not have any outlier.

  • For the Western Europe, the width of the box is smaller than the precedent region, the small width indicates that the values are clustered closely together and have limited variability, the gap between the minimum index and the maximum is also not big, can go from 7.4 to 9.9. It’s median is approximately 8, the lower percentile at approximately 7.8 and the upper percentile at approximately 8.9. The majority of countries in this region are Full democracy. This region looks a lot like the North America’s region. It does have one outlier at approximately 4.35.

2.b.

# Create a density plot to visualize the distribution of the democracy index in 2022 by region
ggplot(list_by_country, aes(x = `X2022`, fill = Region)) +
  geom_density(alpha = 0.2) +
  xlab("Democracy Index 2022") +
  ylab("Density") +
  ggtitle("Density Plot by Region") +
  theme(legend.position = "top")

# Compute summary statistics for each region
summary_table <- list_by_country %>%
  group_by(Region) %>%
  summarize(
    Mean = mean(`X2022`),
    Variance = var(`X2022`),
    Skewness = moments::skewness(`X2022`),
    Kurtosis = moments::kurtosis(`X2022`)
  ) 

summary_table
## # A tibble: 7 × 5
##   Region                           Mean Variance Skewness Kurtosis
##   <chr>                           <dbl>    <dbl>    <dbl>    <dbl>
## 1 Asia and Australasia             5.46    6.68    -0.529     2.32
## 2 Central and Eastern Europe       5.39    4.23    -0.649     1.99
## 3 Latin America and the Caribbean  5.79    3.40    -0.504     2.52
## 4 Middle East and North Africa     3.34    2.22     1.54      5.65
## 5 North America                    8.36    0.530    0         1   
## 6 Sub-Saharan Africa               4.14    3.18     0.506     2.37
## 7 Western Europe                   8.36    1.36    -1.86      7.64
  • Skewness is a statistical number that tells us if a distribution is symmetric or not. We can see on the table that we have created that the skewness or the distribution of north America is 0,therefor its distribution is symmetric. If a distribution is symmetric its a normal distribution, therefor its median, mean and mode are equals. f Skewness is greater than 0, then it is called right-skewed or that the right tail is longer than the left tail. If Skewness is less than 0, then it is called left-skewed or that the left tail is longer than the right tail.

The right-skewed regions are Middle East and North Africa and Sub-Saharan Africa. The left-skewed regions are Asia and Australasia,Central and Eastern Europe, Latin America and the Caribbean and Western Europe.

  • Kurtosis is a statistical number that tells us if a distribution is taller or shorter than a normal distribution. If a distribution is similar to the normal distribution, the Kurtosis value is 0. If Kurtosis is greater than 0, then it has a higher peak compared to the normal distribution. If Kurtosis is less than 0, then it is flatter than a normal distribution.

In our case all the regions Kurtosis are greater than 0, therefor they all have a higher peak compared to the normal distribution.

As we saw in the precedent question, the region Asia and Australasia has a big variance and on the contrary the region North America has a very low variance. Furthermore, we can conclude that in average the countries of the regions Asia and Australasia, Central and Eastern Europe, Latin America and the Caribbean and Sub-Saharan Africa have an hybrid regime; on average the countries in the Middle East and North Africa have an Authoritarian regime and the countries in North America and Western Europe have on average Full democracies.

3.a.

# Define a function to compare democracy index across countries
Comparing_countries <- function(data, country_names) {
  
# Check if the data has a column named "Country", otherwise use "Region"
  
  if ("Country" %in% colnames(data)) {
    place_col <- "Country"
  } else {
    place_col <- "Region"
  } 
   
# Extract the year columns from the data
  year_cols <- grep("^X[0-9]{4}$", colnames(data), value = TRUE)
  years <- unique(as.numeric(substr(year_cols, 2,5)))

# Filter the data for the specified country names
  fil_data <- data[data[[place_col]] %in% country_names, ]
  color <- rainbow(length(country_names))

# Create an empty plot with appropriate axes labels and title
  plot(NULL, xlim = range(years), ylim = c(0, 10),
       xlab = "Year", ylab = "Democracy Index",
       main = "Democracy Index for Countries")  

  # Set the x-axis ticks to the years
  axis(1, at = years, labels = years)

# Loop through each country and plot its democracy index over the years
  for (i in 1:length(country_names)) {
    place <- country_names[i]
    country_data <- fil_data[fil_data[[place_col]] == place, ]
    lines(years, t(country_data[year_cols]), type = "l", col = color[i])
  }

# Add a legend to the plot showing the country names and corresponding colors
  legend("bottomright", inset = 0.02, legend = country_names,
         col = color, pch = 19, bty = "n")

}
   
# Call the Comparing_countries function with a specific set of country names
Comparing_countries(list_by_country, c("France", "United States", "Israel", "Cameroon", "Morocco"))

First, we can observe that there is 3 countries that look alike, where their democracy indexes did not change a lot over the years and stayed high: France, United states and Israel. Second, the 2 other countries, Cameroon and Morocco, looked alike from the year 2006 to 2014, but from the year 2015 they changed in the opposite way. Morocco began te be a little bo democratic an Cameroon a little less.

3.b.

# Calculate the difference in democracy index between 2022 and 2006
index_diff <- list_by_country$X2022 - list_by_country$X2006
# First cluster - Countries with a big increase in democracy index
big_increase_cluster <- list_by_country[index_diff >= 1.5, "Country"]
# Second cluster - Countries with a big decrease in democracy index
big_decrease_cluster <- list_by_country[index_diff <= -1.5, "Country"]
# Third cluster - Countries with a small increase in democracy index
small_increase_cluster <- list_by_country[index_diff > 0.75 & index_diff <= 1.5, "Country"]
# Fourth cluster - Countries with a small decrease in democracy index
small_decrease_cluster <- list_by_country[index_diff < -0.75 & index_diff >= -1.5, "Country"]

# Calculate the minimum index for each country
list_by_country$min_index <- apply(list_by_country[, 6:19], 1, min)

# Fifth cluster - Countries with an increase in democracy index compared to the minimum index in 2006
index_min_diff_2006 <- list_by_country$X2006 - list_by_country$min_index
index_min_diff_2022 <- list_by_country$X2022 - list_by_country$min_index
decrease_increase <- list_by_country$Country[index_min_diff_2006 >= 0.75 & index_min_diff_2022 >= 0.75]

# Calculate the maximum index for each country
list_by_country$max_index <- apply(list_by_country[, 6:19], 1, max)

# Sixth cluster - Countries with a decrease in democracy index compared to the maximum index in 2006
index_max_diff_2006 <- list_by_country$X2006 - list_by_country$max_index
index_max_diff_2022 <- list_by_country$X2022 - list_by_country$max_index
increase_decrease <- list_by_country$Country[index_max_diff_2006 <= -0.75 & index_max_diff_2022 <= -0.75]

# Seventh cluster - Countries with minimal change in democracy index (small difference between max and min)
bare_change <- list_by_country$Country[list_by_country$max_index-list_by_country$min_index < 0.5 ]

# Height cluster - Other countries not included in the previous clusters
other_countries <- list_by_country$Country[!(list_by_country$Country %in% c(big_increase_cluster,big_decrease_cluster,small_increase_cluster, small_decrease_cluster, decrease_increase, increase_decrease, bare_change))]

# Call the Comparing_countries function to compare countries within each cluster
Comparing_countries(list_by_country, big_increase_cluster)

Comparing_countries(list_by_country, big_decrease_cluster)

Comparing_countries(list_by_country, small_increase_cluster)

Comparing_countries(list_by_country, small_decrease_cluster)

Comparing_countries(list_by_country, decrease_increase)

Comparing_countries(list_by_country, increase_decrease)

Comparing_countries(list_by_country, bare_change)

Comparing_countries(list_by_country, other_countries)

  • First cluster: we can see that the three countries began with a small index between 2 and 3 in 2006 and all end up with an index between 3 and 6. Tunisia had the biggest on his pic in 2015.

  • Second cluster: we can see 13 countries that all finished with a democracy index a lot lower in 2022 than in 2006. The one that knew his biggest drop is Afghanistan.

  • Third cluster: we can see 14 countries that all finished with a democracy index a a little bit bigger in 2022 than in 2006. For example we can see that Uruguay began with an index a around 7.95 and finished in 2022 with an index around 8.9

  • Fourth cluster: we can see countries that all finished with a democracy index a a little bit lower in 2022 than in 2006. For example we can see that Myanmar began with an index a around 1.7 and finished in 2022 with an index around 0.7.

  • Fifth cluster: we can see 7 countries that dropped by at least 0.75 points after 2006 and then recovered by at least 0.75 points in 2022 compared to the lowest drop. For example, Gambia had a democracy index of approximately 4.5 in 2006, knew his lowest drop in 2016 around 3 and then recovered by 2022 at approximately 4.4.

  • Sixth cluster: similarly, we can see 10 countries that increased by at least 0.75 points after 2006 and then dropped by at least 0.75 points in 2022 compared to their highest point. For example, Libya had a democracy index of approximately 2 in 2006, knew his highest point in 2012 around 5 and then dropped by 2022 at approximately 2.

  • Seventh cluster: we can see countries that had barely changed from 2006 to 2022, i.e. that the difference between their highest and lowest index was less than 0.5 points. For example Chad had his higher index in 2021 at 1.67 and his lowest index in 2013 at 1.5.

  • Height cluster: all the rest of the countries that did not fit on any of the seven clusters.

# Step 1: Filter the data table for 2006 and 2022
data_for_freq <- select(list_by_country,Country,X2006,X2022)

# Initialize empty columns for regime types
data_for_freq$Regime_Type_2006 <- NA
data_for_freq$Regime_Type_2022 <- NA

# Assign regime type based on score conditions for 2006
data_for_freq$Regime_Type_2006 <- ifelse(data_for_freq$X2006 >= 8.01 & data_for_freq$X2006 <= 10, "Full democracies",
                              ifelse(data_for_freq$X2006 >= 6.01 & data_for_freq$X2006 <= 8, "Flawed democracies",
                              ifelse(data_for_freq$X2006 >= 4.01 & data_for_freq$X2006 <= 6, "Hybrid regimes",
                              ifelse(data_for_freq$X2006 >= 0 & data_for_freq$X2006 <= 4, "Authoritarian regimes", NA))))

# Assign regime type based on score conditions for 2022
data_for_freq$Regime_Type_2022 <- ifelse(data_for_freq$X2022 >= 8.01 & data_for_freq$X2022 <= 10, "Full democracies",
                              ifelse(data_for_freq$X2022 >= 6.01 & data_for_freq$X2022 <= 8, "Flawed democracies",
                              ifelse(data_for_freq$X2022 >= 4.01 & data_for_freq$X2022 <= 6, "Hybrid regimes",
                              ifelse(data_for_freq$X2022 >= 0 & data_for_freq$X2022 <= 4, "Authoritarian regimes", NA))))

# Create a contingency table of regime types for 2006 and 2022
data_for_freqs <- table(data_for_freq$Regime_Type_2006 ,data_for_freq$Regime_Type_2022)

# Compute transition probabilities
transition_probabilities <- prop.table(data_for_freqs, margin = 1)
transition_probabilities
##                        
##                         Authoritarian regimes Flawed democracies
##   Authoritarian regimes            0.83636364         0.00000000
##   Flawed democracies               0.01886792         0.69811321
##   Full democracies                 0.00000000         0.23076923
##   Hybrid regimes                   0.36363636         0.15151515
##                        
##                         Full democracies Hybrid regimes
##   Authoritarian regimes       0.00000000     0.16363636
##   Flawed democracies          0.07547170     0.20754717
##   Full democracies            0.76923077     0.00000000
##   Hybrid regimes              0.00000000     0.48484848
# Define the categories for regime types
regime_categories <- c("Full democracy", "Flawed democracy", "Hybrid regime", "Authoritarian")

# Create a matrix of transition probabilities with row and column names
prob_table <- matrix(transition_probabilities, nrow = 4, byrow = TRUE, dimnames = list(regime_categories, regime_categories))


# Create a heatmap to visualize the regime transition probabilities
heatmap(prob_table, col = colorRampPalette(c("white", "green"))(20), main = "Regime Transition Probabilities")

This table shows the probabilities of transitioning from one regime type to another. Each row represents the starting regime type, and each column represents the ending regime type. The values in the table represent the probabilities of transitioning from the starting regime type to the ending regime type. For example:

  • The probability of transitioning from an Authoritarian regime to another Authoritarian regime is 0.83636364.
  • The probability of transitioning from a Flawed democracy to a Full democracy is 0.07547170.
  • The probability of transitioning from a Hybrid regime to an Authoritarian regime is 0.36363636.

On the heat map we can see that a Hybrid regime has a probability of 0 to become an authoritarian regime. But a Flawed democracy has some chances to become authoritarian.

5.a.

gdp_url<- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita")
population_size_url<- read_html("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population")
incarnation_rates_url<- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_incarceration_rate")
area_url<- read_html("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area")

# Extracting all the tables from HTML data
gdp.tables = html_nodes(gdp_url, "table")  
gdp_table <- as.data.frame(html_table(gdp.tables[2], fill = TRUE))  


pop.tables = html_nodes(population_size_url, "table")  
population_size_table <- as.data.frame(html_table(pop.tables[2], fill = TRUE))  

incar.tables = html_nodes(incarnation_rates_url, "table")  
incarnation_rate_table <- as.data.frame(html_table(incar.tables[2], fill = TRUE))


area.tables = html_nodes(area_url, "table")  
area_table <- as.data.frame(html_table(area.tables[2], fill = TRUE))  

# Renaming columns of the tables so its more intuative to use
colnames(gdp_table)[colnames(gdp_table) == "Country.Territory"] <- "Country"
colnames(population_size_table)[colnames(population_size_table) == "Country...Dependency"] <- "Country"
colnames(incarnation_rate_table)[colnames(incarnation_rate_table) == "Location"] <- "Country"
colnames(area_table)[colnames(area_table) == "Country...Dependency"] <- "Country"

# Cleaning country names in the GDP table
gdp_table$Country <- gsub("\\*$", "", gdp_table$Country)
gdp_table$Country <- gsub("\\ ", "", gdp_table$Country)


# Merging tables based on the "Country" column
joined_table <- merge(list_by_country, gdp_table, by = "Country", all.x = TRUE)
joined_table <- merge(joined_table, population_size_table, by = "Country", all.x = TRUE)
joined_table <- merge(joined_table, incarnation_rate_table, by = "Country", all.x = TRUE)
joined_table <- merge(joined_table, area_table, by = "Country", all.x = TRUE)
joined_table <- as.data.frame(joined_table)

# Displaying the head of the joined table
head(joined_table)
##       Country                        Region.x X2022.rank      Regime.type X2022
## 1 Afghanistan            Asia and Australasia        167    Authoritarian  0.32
## 2     Albania      Central and Eastern Europe         64 Flawed democracy  6.41
## 3     Algeria    Middle East and North Africa        113    Authoritarian  3.66
## 4      Angola              Sub-Saharan Africa        109    Authoritarian  3.96
## 5   Argentina Latin America and the Caribbean         50 Flawed democracy  6.85
## 6     Armenia      Central and Eastern Europe         82    Hybrid regime  5.63
##   X2021 X2020 X2019 X2018 X2017 X2016 X2015 X2014 X2013 X2012 X2011 X2010 X2008
## 1  0.32  2.85  2.85  2.97  2.55  2.55  2.77  2.77  2.48  2.48  2.48  2.48  3.02
## 2  6.11  6.08  5.89  5.98  5.98  5.91  5.91  5.67  5.67  5.67  5.81  5.86  5.91
## 3  3.77  3.77  4.01  3.50  3.56  3.56  3.95  3.83  3.83  3.83  3.44  3.44  3.32
## 4  3.37  3.66  3.72  3.62  3.62  3.40  3.35  3.35  3.35  3.35  3.32  3.32  3.35
## 5  6.81  6.95  7.02  7.02  6.96  6.96  7.02  6.84  6.84  6.84  6.84  6.84  6.63
## 6  5.49  5.35  5.54  4.79  4.11  3.88  4.00  4.13  4.02  4.09  4.09  4.09  4.09
##   X2006 min_index max_index UN.Region IMF.5..6. IMF.5..6..1 World.Bank.7.
## 1  3.06      0.32      3.06      Asia     2,456        2020         1,666
## 2  5.91      5.67      6.11    Europe    19,029        2023        15,709
## 3  3.17      3.17      4.01    Africa    13,507        2023        12,128
## 4  2.41      2.41      3.72    Africa     7,222        2023         6,491
## 5  6.63      6.63      7.02  Americas    27,261        2023        23,650
## 6  4.15      3.88      5.54      Asia    19,489        2023        15,593
##   World.Bank.7..1 CIA.8..9..10. CIA.8..9..10..1 Rank.x Population Population.1
## 1            2021         1,500            2021     46 32,890,171       0.409%
## 2            2021        14,500            2021    137  2,793,592      0.0348%
## 3            2021        11,000            2021     32 45,400,000       0.565%
## 4            2021         5,900            2021     44 33,086,278       0.412%
## 5            2021        21,500            2021     31 46,044,703       0.573%
## 6            2021        14,200            2021    134  2,981,200      0.0371%
##          Date Source..official.or.from.the.United.Nations. Notes.x Region.y
## 1  1 Jul 2020                        Official estimate[48]             <NA>
## 2  1 Jan 2022                       Official estimate[134]             <NA>
## 3  1 Jan 2022                        Official estimate[35]           Africa
## 4 30 Jun 2022               National annual projection[46]           Africa
## 5 18 May 2022           2022 census preliminary result[34]             <NA>
## 6  1 Jan 2023             National quarterly estimate[131]             <NA>
##   Count.2. Rate.per.100.000..3. Male.....a. Female.....4. National.....b.
## 1     <NA>                 <NA>        <NA>          <NA>            <NA>
## 2     <NA>                 <NA>        <NA>          <NA>            <NA>
## 3   94,749                  217        98.5           1.5            96.2
## 4   24,966                   79        97.3           2.7               —
## 5     <NA>                 <NA>        <NA>          <NA>            <NA>
## 6     <NA>                 <NA>        <NA>          <NA>            <NA>
##   Foreign.....5. Occupancy.....6. Remand.....7. Rank.y     Totalin.km2..mi2.
## 1           <NA>             <NA>          <NA>     40     652,867 (252,073)
## 2           <NA>             <NA>          <NA>    140       28,748 (11,100)
## 3            3.8             89.3          12.0     10   2,381,741 (919,595)
## 4              —            110.8          45.8     22   1,246,700 (481,400)
## 5           <NA>             <NA>          <NA>      8 2,780,400 (1,073,500)
## 6           <NA>             <NA>          <NA>    138       29,743 (11,484)
##        Landin.km2..mi2. Waterin.km2..mi2. X.water   Notes.y
## 1     652,867 (252,073)             0 (0)       0      <NA>
## 2       27,398 (10,578)       1,350 (520)     4.7          
## 3   2,381,741 (919,595)             0 (0)       0 [Note 13]
## 4   1,246,700 (481,400)             0 (0)       0          
## 5 2,736,690 (1,056,640)   43,710 (16,880)     1.6 [Note 11]
## 6       28,342 (10,943)       1,401 (541)     4.7

We can see the first 5 countries in this new joined table with all the necessaries information added, like the rate of incarnation or the number of population for every countries. For example the top country is Afghanistan and we can see its rank in 2022, its regime type and the rates of democracy between the years 2006 to 2022.

5.b.

# Removing commas from all columns of the joined table
joined_table <- data.frame(lapply(joined_table, function(x) gsub(",", "", x)))
gdp_table <- data.frame(lapply(gdp_table, function(x) gsub(",", "", x)))


# Renaming specific columns in the joined table
colnames(joined_table)[colnames(joined_table) == "IMF.5..6."] <- "IMF.est"
colnames(joined_table)[colnames(joined_table) == "IMF.5..6..1"] <- "IMF.year"
colnames(joined_table)[colnames(joined_table) == "World.Bank.7."] <- "World.bank.est"
colnames(joined_table)[colnames(joined_table) == "World.Bank.7..1"] <- "World.bank.year"
colnames(joined_table)[colnames(joined_table) == "CIA.8..9..10."] <- "CIA.est"
colnames(joined_table)[colnames(joined_table) == "CIA.8..9..10..1"] <- "CIA.year"

# Creating a new data frame with selected columns and removing rows with NA values
data <- na.omit(joined_table[, c("CIA.est", "X2022")])

# Converting selected columns to numeric
data$CIA.est <- as.numeric(data$CIA.est)
data$X2022 <- as.numeric(data$X2022)

# Performing linear regression using CIA.est as the response variable and X2022 as the predictor
reg_mod_cia <- lm(CIA.est ~ X2022, data = data)
# Displaying the summary of the linear regression model
summary(reg_mod_cia)
## 
## Call:
## lm(formula = CIA.est ~ X2022, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27148 -11701  -3187   6754  80120 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6166.1     3551.0  -1.736   0.0844 .  
## X2022         5152.2      609.2   8.457 1.48e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18440 on 163 degrees of freedom
## Multiple R-squared:  0.305,  Adjusted R-squared:  0.3007 
## F-statistic: 71.52 on 1 and 163 DF,  p-value: 1.481e-14
# Plotting the relationship between Democracy Index (X2022) and GDP (PPP) per capita (CIA.est)
plot(data$X2022, data$CIA.est, xlab = "Democracy Index", ylab = "GDP (PPP) per capita")
abline(reg_mod_cia, col = "red")

# Renaming specific columns in the joined table
colnames(joined_table)[colnames(joined_table) == "Rate.per.100.000..3."] <- "Rate.100000"
data1 <- na.omit(joined_table[, c("Rate.100000", "X2022")])

# Converting selected columns to numeric
data1$Rate.100000 <- as.numeric(data1$Rate.100000)
data1$X2022 <- as.numeric(data1$X2022)

# Performing linear regression using Rate.100000 as the response variable and X2022 as the predictor
reg_mod_inca <- lm(Rate.100000 ~ X2022, data = data1)

# Displaying the summary of the linear regression model for Incarceration Rate
summary(reg_mod_inca)
## 
## Call:
## lm(formula = Rate.100000 ~ X2022, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -124.18  -85.37  -36.29   38.46  435.67 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  154.699     36.734   4.211 9.68e-05 ***
## X2022         -3.345      7.320  -0.457     0.65    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 122.7 on 54 degrees of freedom
## Multiple R-squared:  0.003852,   Adjusted R-squared:  -0.0146 
## F-statistic: 0.2088 on 1 and 54 DF,  p-value: 0.6495
# Plotting the relationship between Democracy Index (X2022) and Incarceration Rate (Rate.100000)
plot(data1$X2022, data1$Rate.100000, xlab = "Democracy Index", ylab = "Incarnation Rate")
abline(reg_mod_inca, col="red")

On the first plot, that analyze the connection between the democracy index and GDP (PPP) per capita, we can see a strong connection and correlation, the regression line is clear and logical, there is some exceptions but the main are linked. We can conclude that the more the democracy index of a country is high, the more the GDP per capita will also be high. On a second hand, the second plot that analyse the connection between the democracy index and the incarnation rate doesn’t show a real connection. The regression line is almost horizontal, the points are scattered and not really explaining the regression line.

6.a.

# Find countries that appear more than once
duplicate_countries <- joined_table$Country[duplicated(joined_table$Country)]

# Remove rows with duplicate countries
joined_table <- joined_table[!(joined_table$Country %in% duplicate_countries), ]

# remove all the NA's from the CIA est and convert the data into numeric
x <- as.numeric(na.omit(joined_table$CIA.est))

# Calculate de CDF of X
ecdf_X <- ecdf(x)

plot(ecdf_X, xlab ="x", ylab="ECDF", main= "Empirical CDF", col.main="blue", pch = 16)

On this plot we can see the cdf of gdp per capita of a randomly selected group of a 1,000 countries out of all the counties data. the X axis on the graph represents the GDP per capita, and the Y axis represents the Empirical cdf. We can also see where x is located on that scale, where x is also a randomly selected country.

6.b.

# Convert Population and CIA.est to numeric type
joined_table$Population <- as.numeric(joined_table$Population)
joined_table$CIA.est <- as.numeric(joined_table$CIA.est)

# Select the relevant columns and remove missing values
gdp_pers <- na.omit(joined_table %>% select(Country, CIA.est, Population))

# Round the Population column to whole numbers
gdp_pers$Population <- round(gdp_pers$Population)

# Calculate weights based on population
gdp_pers <- gdp_pers %>% mutate(Weights = gdp_pers$Population / sum(gdp_pers$Population))

# Calculate empirical weighted cumulative distribution function (EW-CDF)
ecdf_Y <- ewcdf(gdp_pers$CIA.est, weights = gdp_pers$Weights)

# Plot the empirical weighted cumulative distribution function (EW-CDF)
plot(ecdf_Y, xlab = "GDP (PPP) per capita in Int$", ylab = "EWCDF",
     main = "EW-CDF of GDP per capita of a randomly selected person",
     sub = "(Weighted by Population of a person's country)",
     col.main = "blue", col.sub = "blue", pch = 16, verticals = TRUE, do.points = FALSE)

In this graph we can see that the majority of the population (0.8) have a 20,000 gdp and under, while the rest (0.2) is located between 20k and 60k.

6.c.

# Change column name for better readability
colnames(joined_table)[colnames(joined_table) == "Landin.km2..mi2."] <- "LandArea"

# Extract land area values and convert to numeric type
land_area <- as.numeric(joined_table$LandArea)
## Warning: NAs introduced by coercion
# Remove any parentheses and their contents from land area values
joined_table$LandArea <- gsub("\\s*\\(.*\\)", "", joined_table$LandArea)

# Convert land area values to numeric type
joined_table$LandArea <- as.numeric(joined_table$LandArea)

# Select relevant columns for GDP per capita calculation and remove missing values
gdp_area <- na.omit(joined_table %>% select(Country, CIA.est, LandArea))

# Convert land area values to numeric and round to nearest whole number
gdp_area$LandArea <- as.numeric(gdp_area$LandArea)
gdp_area$LandArea <- round(gdp_area$LandArea)

# Calculate weights based on land area
gdp_area <- gdp_area %>% mutate(Weights = gdp_area$LandArea / sum(gdp_area$LandArea))

# Calculate empirical weighted cumulative distribution function (EW-CDF)
ecdf_Z <- ewcdf(gdp_area$CIA.est, weights = gdp_area$Weights)

# Plot the empirical weighted cumulative distribution function (EW-CDF)
plot(ecdf_Z, xlab = "GDP (PPP) per capita in Int$", ylab = "EWCDF",
     main = "EW-CDF of GDP per capita of a randomly selected person",
     sub = "(Weighted by land-area of a person's country)",
     col.main = "blue", col.sub = "blue", pch = 16, verticals = TRUE, do.points = FALSE)

# Calculate median and percentiles for X, Y, and Z
median_X <- quantile(ecdf_X,probs = 0.5)
median_Y <- quantile(ecdf_Y,probs = 0.5)
median_Z <- quantile(ecdf_Z,probs = 0.5)

percentile_25_X <- quantile(ecdf_X,probs = 0.25)
percentile_25_Y <- quantile(ecdf_Y,probs = 0.25)
percentile_25_Z <- quantile(ecdf_Z,probs = 0.25)

percentile_75_X <- quantile(ecdf_X,probs = 0.75)
percentile_75_Y <- quantile(ecdf_Y,probs = 0.75)
percentile_75_Z <- quantile(ecdf_Z,probs = 0.75)

# Add vertical lines at the quantiles
abline(v = median_Z, col = "red", lty = 2)   # 25% quantile
abline(v = percentile_25_Z, col = "green", lty = 2) # 50% quantile
abline(v = percentile_75_Z, col = "blue", lty = 2)  # 75% quantile

# Create a data frame for comparison
comparison <- data.frame(Variable = c("X", "Y", "Z"),
                         Median = c(median_X, median_Y, median_Z),
                         Percentile_25 = c(percentile_25_X, percentile_25_Y, percentile_25_Z),
                         Percentile_75 = c(percentile_75_X, percentile_75_Y, percentile_75_Z))


comparison
##   Variable Median Percentile_25 Percentile_75
## 1        X  13400          4900         32100
## 2        Y  11900          6600         17600
## 3        Z  17600         11000         41900

In this graph we can see that there is a better scatter of the gdp for a randonly selected person by land area : majority of the population (0.6) have a 30,000 gdp and under, and the (0.4) is located between 30k and 65k.

The median of X is slightly higher than Y and Z .It might be because dataset for X represents the GDP per capita values of different countries. This means that it includes a wide range of values from various countries, including both high-income and low-income countries. As a result, the dataset for X captures the economic diversity among countries, which can contribute to a higher median compared to the dataset for Y or Z.

The 25th and the 75th percentile of X is slightly higher than Y and Z. It might be because the dataset for X represents GDP per capita values of different countries. Since countries have varying levels of economic development, including both high-income and low-income countries, the dataset for X captures a wide range of GDP per capita values. As a result, the 25th and the 75th percentile of X are expected to be higher due to the inclusion of countries with higher GDP per capita values.

7

#select the country names and from year 2006 till 2022
avg_table <- select(list_by_country,Country,c(5:19))

#create a new empty column to store the average for each country 
avg_table$avg <- NA
row_sums <- rowSums(avg_table[, c(2:16)])
#enter and calculate averages for each country 
avg_table$avg <- row_sums/15

world_map <- joinCountryData2Map(avg_table, joinCode = "NAME", nameJoinColumn = "Country")
## 165 codes from your data successfully matched countries in the map
## 2 codes from your data failed to match with a country code in the map
## 78 codes from the map weren't represented in your data
#The world map with color based on average value
mapCountryData(world_map, nameColumnToPlot = "avg", mapTitle = "Average Value",
               catMethod = "fixedWidth", numCats = 10, missingCountryCol = "white", addLegend = TRUE, oceanCol = "lightblue")

table_2006 <- select(list_by_country,Country,c(19))
mapCountryData(world_map, nameColumnToPlot = "X2006", mapTitle = "2006 Values",
               catMethod = "fixedWidth", numCats = 10, missingCountryCol = "white", addLegend = TRUE, oceanCol = "lightblue")

table_2022 <- select(list_by_country,Country,c(5))
table_2006 <- select(list_by_country,Country,c(19))
mapCountryData(world_map, nameColumnToPlot = "X2022", mapTitle = "2022 Values",
               catMethod = "fixedWidth", numCats = 10, missingCountryCol = "white", addLegend = TRUE, oceanCol = "lightblue")

In 2006 Africa was mostly low democratic with some exceptions, and Russia was orange - on the middle of the scale. America and Europe and Australia are very democratic.

In 2022 Africa was has become more democratic (more orange countries) with some exceptions, and Russia has become yellow - less democratic than it was. America and Europe and Australia stayed almost the same.

We calculated the average democracy score between 2006 and 2022 for each country. then we created a scale representing the scores- from the lowest average (1.06) and colored it close to white, the higest (9.83) as red. the countries received colors based on the scale colors (white to red). By reviewing the results we can say that countries on the West side of the map are the most democratic. Countries in the South-EAST edge and Europe are very democratic as well. On the other hand, Africa, the middle east and the East-South countries have low democratic scores (with exceptions).

8.a.

#Joining the components table in a new variable:
joined_comp <- merge(components,joined_table,by = "Country")

#changing the titles of 5 columns that we'll need :

colnames(joined_comp)[colnames(joined_comp) == "Elec.toral.pro.cessand.plura.lism"] <- "Electoral_processand_pluralism"
colnames(joined_comp)[colnames(joined_comp) == "Func.tioningof.govern.ment"] <- "Functioning_of_government"
colnames(joined_comp)[colnames(joined_comp) == "Poli.ticalpartici.pation"] <- "Political_participation"
colnames(joined_comp)[colnames(joined_comp) == "Poli.ticalcul.ture"] <- "Political_culture"
colnames(joined_comp)[colnames(joined_comp) == "Civilliber.ties"] <- "Civil_liberties"

#Show the first 5
head(joined_comp,5)
##       Country Rank
## 1 Afghanistan  167
## 2     Albania   64
## 3     Algeria  113
## 4      Angola  109
## 5   Argentina   50
##   .mw.parser.output..tooltip.dotted.border.bottom.1px.dotted.cursor.help.Δ.Rank
## 1                                                                              
## 2                                                                             4
## 3                                                                              
## 4                                                                            13
## 5                                                                              
##      Regime.type.x Overall.score Δ.Score Electoral_processand_pluralism
## 1    Authoritarian          0.32                                   0.00
## 2 Flawed democracy          6.41    0.30                           7.00
## 3    Authoritarian          3.66    0.11                           3.08
## 4    Authoritarian          3.96    0.59                           4.50
## 5 Flawed democracy          6.85                                   9.17
##   Functioning_of_government Political_participation Political_culture
## 1                      0.07                    0.00              1.25
## 2                      6.43                    5.00              6.25
## 3                      2.50                    3.89              5.00
## 4                      3.21                    4.44              5.00
## 5                      5.00                    7.78              4.38
##   Civil_liberties                        Region.x X2022.rank    Regime.type.y
## 1            0.29            Asia and Australasia        167    Authoritarian
## 2            7.35      Central and Eastern Europe         64 Flawed democracy
## 3            3.82    Middle East and North Africa        113    Authoritarian
## 4            2.65              Sub-Saharan Africa        109    Authoritarian
## 5            7.94 Latin America and the Caribbean         50 Flawed democracy
##   X2022 X2021 X2020 X2019 X2018 X2017 X2016 X2015 X2014 X2013 X2012 X2011 X2010
## 1  0.32  0.32  2.85  2.85  2.97  2.55  2.55  2.77  2.77  2.48  2.48  2.48  2.48
## 2  6.41  6.11  6.08  5.89  5.98  5.98  5.91  5.91  5.67  5.67  5.67  5.81  5.86
## 3  3.66  3.77  3.77  4.01   3.5  3.56  3.56  3.95  3.83  3.83  3.83  3.44  3.44
## 4  3.96  3.37  3.66  3.72  3.62  3.62   3.4  3.35  3.35  3.35  3.35  3.32  3.32
## 5  6.85  6.81  6.95  7.02  7.02  6.96  6.96  7.02  6.84  6.84  6.84  6.84  6.84
##   X2008 X2006 min_index max_index UN.Region IMF.est IMF.year World.bank.est
## 1  3.02  3.06      0.32      3.06      Asia    2456     2020           1666
## 2  5.91  5.91      5.67      6.11    Europe   19029     2023          15709
## 3  3.32  3.17      3.17      4.01    Africa   13507     2023          12128
## 4  3.35  2.41      2.41      3.72    Africa    7222     2023           6491
## 5  6.63  6.63      6.63      7.02  Americas   27261     2023          23650
##   World.bank.year CIA.est CIA.year Rank.x Population Population.1        Date
## 1            2021    1500     2021     46   32890171       0.409%  1 Jul 2020
## 2            2021   14500     2021    137    2793592      0.0348%  1 Jan 2022
## 3            2021   11000     2021     32   45400000       0.565%  1 Jan 2022
## 4            2021    5900     2021     44   33086278       0.412% 30 Jun 2022
## 5            2021   21500     2021     31   46044703       0.573% 18 May 2022
##   Source..official.or.from.the.United.Nations. Notes.x Region.y Count.2.
## 1                        Official estimate[48]             <NA>     <NA>
## 2                       Official estimate[134]             <NA>     <NA>
## 3                        Official estimate[35]           Africa    94749
## 4               National annual projection[46]           Africa    24966
## 5           2022 census preliminary result[34]             <NA>     <NA>
##   Rate.100000 Male.....a. Female.....4. National.....b. Foreign.....5.
## 1        <NA>        <NA>          <NA>            <NA>           <NA>
## 2        <NA>        <NA>          <NA>            <NA>           <NA>
## 3         217        98.5           1.5            96.2            3.8
## 4          79        97.3           2.7               —              —
## 5        <NA>        <NA>          <NA>            <NA>           <NA>
##   Occupancy.....6. Remand.....7. Rank.y Totalin.km2..mi2. LandArea
## 1             <NA>          <NA>     40   652867 (252073)   652867
## 2             <NA>          <NA>    140     28748 (11100)    27398
## 3             89.3          12.0     10  2381741 (919595)  2381741
## 4            110.8          45.8     22  1246700 (481400)  1246700
## 5             <NA>          <NA>      8 2780400 (1073500)  2736690
##   Waterin.km2..mi2. X.water   Notes.y
## 1             0 (0)       0      <NA>
## 2        1350 (520)     4.7          
## 3             0 (0)       0 [Note 13]
## 4             0 (0)       0          
## 5     43710 (16880)     1.6 [Note 11]
#calculate the correlations between those 

selected_columns <- select(joined_comp,Electoral_processand_pluralism, Functioning_of_government, 
                      Political_participation, Political_culture, Civil_liberties)

selected_columns <- as.data.frame(lapply(selected_columns, as.numeric))

cor_matrix <- cor(selected_columns)

corrplot(cor_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45)

In the first data table we have presented the top five rows of the new merged data.

In the heatmap of the five democracy elements correlations we can see that the 2 pairs - Electoral process and pluralism (EEP) and Civil liberties (CL), Functioning of government(FG) and Civil liberties,have the highest correlations (above 0.8). We can say that the following pairs have highest correlations : (EPP ,FG),(EPP ,Political Participation(PP)), (PP, CL). The lowest pairs are: (EPP,Political culture) and (Political Participation,Political culture).

Generally speaking, if the color is darker than it has a stronger correlation

8.b.

joined_comp <- joined_comp[complete.cases(joined_comp$CIA.est), ]

joined_comp$Electoral_processand_pluralism <- as.numeric(joined_comp$Electoral_processand_pluralism)
joined_comp$Functioning_of_government <- as.numeric(joined_comp$Functioning_of_government)
joined_comp$Political_participation <- as.numeric(joined_comp$Political_participation)
joined_comp$Political_culture <- as.numeric(joined_comp$Political_culture)
joined_comp$Civil_liberties <- as.numeric(joined_comp$Civil_liberties)


model <- lm(joined_comp$CIA.est ~ Electoral_processand_pluralism + Functioning_of_government + 
              Political_participation + Political_culture + Civil_liberties, data = joined_comp)

# Show the summary of the regression analysis
summary(model)
## 
## Call:
## lm(formula = joined_comp$CIA.est ~ Electoral_processand_pluralism + 
##     Functioning_of_government + Political_participation + Political_culture + 
##     Civil_liberties, data = joined_comp)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33113  -9146  -2288   7451  67080 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -15562.8     4635.4  -3.357 0.000985 ***
## Electoral_processand_pluralism  -2969.7      893.8  -3.322 0.001108 ** 
## Functioning_of_government        4857.7     1040.2   4.670 6.37e-06 ***
## Political_participation           624.1     1094.7   0.570 0.569421    
## Political_culture                2668.6      929.5   2.871 0.004647 ** 
## Civil_liberties                  2367.4     1327.2   1.784 0.076384 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16470 on 159 degrees of freedom
## Multiple R-squared:  0.459,  Adjusted R-squared:  0.442 
## F-statistic: 26.98 on 5 and 159 DF,  p-value: < 2.2e-16
# Extract the p-values for each coefficient
p_values <- summary(model)$coefficients[, 4]

# Identify the significant coefficients at alpha = 0.01
significant_coefficients <- names(p_values[p_values < 0.01])

# Print the significant coefficients
cat("Significant coefficients at alpha = 0.01:\n")
## Significant coefficients at alpha = 0.01:
cat(significant_coefficients, sep = ", ")
## (Intercept), Electoral_processand_pluralism, Functioning_of_government, Political_culture
# Calculate the residuals
residuals <- residuals(model)

# Combine the residuals with the country names
residuals_df <- data.frame(Country = joined_comp$Country, Residuals = residuals)



# Sort the dataframe by the absolute value of residuals in descending order
residuals_df <- residuals_df[order(abs(residuals_df$Residuals), decreasing = TRUE), ]

# Display the top 5 countries with the highest residuals
cat("Countries with the highest residuals:\n")
## Countries with the highest residuals:
head(residuals_df, 5)
##                  Country Residuals
## 91            Luxembourg  67079.56
## 126                Qatar  65918.29
## 135            Singapore  59621.97
## 73               Ireland  54402.18
## 158 United Arab Emirates  42450.02
# Display the bottom 5 countries with the lowest residuals
cat("Countries with the lowest residuals:\n")
## Countries with the lowest residuals:
tail(residuals_df, 5)
##         Country   Residuals
## 113 North Korea  740.521309
## 79       Jordan -642.005483
## 165       Yemen  208.853652
## 147      Taiwan  102.620253
## 28         Chad    6.765322

The top five highest residuals are : Luxembourg, Qatar, Singapore, Ireland and United Arab Emirates.

The top five lowest residuals are : North Korea,Jordan,Yemen,Taiwan and Chad

Other factors contributing to their high or low GDP per capita could include:

Economic policies and governance Natural resources and their management Political stability Education and human capital Infrastructure development Trade and international relations Technological advancements Income inequality and distribution Access to healthcare and social services Environmental factors